Fix thread safety issues in MLX concurrent inference (Samplers + TokenIterator) #351

sxy-trans-n · 2025-07-07T07:49:35Z

🐛 Problem

The MLX Swift Examples library suffers from multiple thread safety issues when used in concurrent inference scenarios. The issues manifest at two levels:

Sampler Level: CategoricalSampler and TopPSampler race on MLXRandom.globalState
Evaluation Level: TokenIterator instances race on MLX's internal evaluation engine

Error Symptoms

Crash: [eval] Attempting to eval an array without a primitive
Inconsistent results: Different outputs for identical inputs
Memory corruption: Occasional segmentation faults during concurrent sampling
Evaluation errors: MLX internal assertions when multiple asyncEval() calls overlap

Root Cause Analysis

1. Sampler Race Conditions

Both samplers use compiled functions that implicitly access the global random state:

compile(inputs: [MLXRandom.globalState], outputs: [MLXRandom.globalState]) { ... }

2. MLX Evaluation Race Conditions

Multiple TokenIterator instances calling asyncEval() concurrently cause race conditions in MLX's internal evaluation engine, even when samplers are properly synchronized.

When multiple threads call these operations concurrently, they race to access and modify shared state, causing undefined behavior.

🔧 Solution

Added comprehensive thread safety at both the sampler and evaluation levels using NSLock to serialize access to shared MLX resources.

Key Changes

CategoricalSampler: Added randomStateLock to protect global random state access
TopPSampler: Added randomStateLock to protect global random state access
TokenIterator: Added mlxEvalLock to serialize MLX evaluation operations (asyncEval, model.prepare)

Design Decisions

Minimal code changes: No API breaking changes, maintains backward compatibility
Static locks per component: Minimize memory overhead while ensuring proper synchronization
Focused critical sections: Lock only the essential operations to minimize contention

📊 Performance Impact

Sampler Level: Minimal impact (~0.001% overhead) - sampling is <1% of total inference time
TokenIterator Level: Moderate impact - model evaluations are serialized, but throughput remains 3-4x better than pure serial processing

Observed Pattern: 10 concurrent requests complete in 3-4 batches rather than full parallelism, but with 100% stability.

davidkoski · 2025-07-08T16:14:49Z

Libraries/MLXLMCommon/Evaluate.swift

@@ -133,6 +133,8 @@ public struct ArgMaxSampler: LogitSampler {

 /// Sampler that uses `topP` and `temperature` to sample the logits.
 public struct TopPSampler: LogitSampler {
+    private static let randomStateLock = NSLock()


This won't protect it -- randomState is global and this is only protecting callers of TopPSampler. For example it will not guard against concurrent use in CategoricalSampler.

The better way to fix this would be to have random state scoped to the sampler itself, see:

https://swiftpackageindex.com/ml-explore/mlx-swift/main/documentation/mlx/withrandomstate(_:body:)-18ob4

https://github.com/ml-explore/mlx-swift/blob/main/Tests/MLXTests/MLXRandomTests.swift#L237

To use locks, all callers of Random would have to use the same lock. Actually it is more complicated than that -- the calls to globalState are themselves thread safe:

https://swiftpackageindex.com/ml-explore/mlx-swift/main/documentation/mlx/mlxrandom/randomstate

but the calls to evaluate the resulting MLXArrays are not -- you need to guard the eval sites.

davidkoski · 2025-07-08T16:15:43Z

Libraries/MLXLMCommon/Evaluate.swift

@@ -166,6 +168,10 @@ public struct TopPSampler: LogitSampler {
            logits = logits.asType(.float32)
        }

+        // Thread-safe sampling to prevent concurrent access to global random state
+        TopPSampler.randomStateLock.lock()
+        defer { TopPSampler.randomStateLock.unlock() }


FWIW the typical way to use a lock like this is:

lock.withLock { compiledTopPSampling(...) }

but see my other comment on the use of locks to guard this

davidkoski · 2025-07-08T16:21:49Z

Libraries/MLXLMCommon/Evaluate.swift

@@ -267,6 +279,9 @@ public struct RepetitionContext: LogitProcessor {
 ///
 /// Note: this uses `asyncEval()` and there may be an async evaluation running after a call to `next()`.
 public struct TokenIterator: Sequence, IteratorProtocol {
+    // Global lock to protect MLX evaluation operations
+    private static let mlxEvalLock = NSLock()


See:

https://github.com/ml-explore/mlx-swift/blob/main/Source/MLX/Transforms%2BEval.swift#L9

This guards only concurrent calls in TokenIterator. In theory calls to eval() and asyncEval() should be thread safe as long as callers are using entirely distinct MLXArrays / compute graphs. In practice, that was never really a guarantee from mlx::core and in mlx-swift 0.25.1 we found new issues around this (changes on the core side).

The evalLock in mlx-swift is wider than just eval -- it has to guard a number of calls. It may be removed sometime in the future if we can restore the thread safe behavior in mlx::core.

Anyway, that said, this does not guard against concurrent use of the same model (which would still be a problem), nor does it add anything over the lock already in mlx-swift (that I can see, anyway).

davidkoski · 2025-07-08T16:25:53Z

I am curious about the use case where you encountered errors/crashes. I don't think the locks added here are the correct way to protect the state -- they are either too narrow (sampling) or redundant (evals in the iterator). If the use case is multiple threads evaluating the same model, then I don't think these are sufficient.

I do agree with your Problem statement -- there are thread safety concerns here, but I think we need different approaches if these are important to guard against. Many of the threading issues are guarded against, but perhaps not all. Can you please describe how you are encountering these?

sxy-trans-n · 2025-07-09T04:58:31Z

@davidkoski

Thank you for the detailed explanation! This clarifies why our locks are insufficient.

Our Use Case

We're running Swama as an OpenAI HTTP server where multiple concurrent requests hit the same model instance. We currently serialize all model access to avoid crashes, but want to enable parallelism for better throughput.

The Problem

We hit [eval] Attempting to eval an array without a primitive crashes in two scenarios:

Non-deterministic sampling (temp > 0): CategoricalSampler/TopPSampler concurrently access MLXRandom.globalState
Even deterministic sampling (temp = 0): Multiple ArgMaxSampler requests cause concurrent asyncEval() calls to interfere

Questions

Should we refactor samplers to use withRandomState instead of compile(inputs: [MLXRandom.globalState], ...)?
For our HTTP server use case, what's the recommended approach - separate model instances per request or something else?
Are the asyncEval() issues with ArgMaxSampler expected to be handled by MLX-Swift's existing evalLock?

Next Steps

If withRandomState is the right direction, we'd be happy to implement and test the fix in our fork, then submit a PR. We have a stress test that consistently reproduces the issues if that would be helpful.

We appreciate your guidance on the proper solution!

ronaldmannak · 2025-07-09T06:07:14Z

I'm very interested in concurrency support like what Ollama does too. And while there are some Swift-concurrency issues, I wonder if Swift-concurrency is the root cause of the issues you've encountered. I haven't looked in it in detail, but assume state (e.g. KV cache) is the main reason MLX-Swift LM can't handle concurrent requests correctly. Let me know if I'm mistaken here

davidkoski · 2025-07-09T19:51:01Z

If withRandomState is the right direction, we'd be happy to implement and test the fix in our fork, then submit a PR. We have a stress test that consistently reproduces the issues if that would be helpful.

I think that is the right direction -- that will give the samplers independent random state, you just need to make sure that each thread of execution has its own samplers.

I think any issues you can find with the stress test would be awesome.

I believe that:

fully evaluated model weights
independant random state
independant KVCache state

Should give us multithreaded evaluation -- I have done something like this in the past where I had two VLMs running at once. You need to be careful of the prompt processing because that is submitting larger batches and those need to finish before the next piece of work can queue up.

We can also set up some integration tests like this:

https://github.com/ml-explore/mlx-swift-examples/blob/main/Tests/MLXLMTests/EvalTests.swift

That can easily do some multithreaded evaluation and we can use this to show 1) how to do it and 2) make sure it keeps working as expected.

sxy-trans-n · 2025-07-10T15:00:30Z

Thank you for the positive feedback! You're absolutely right about the direction.

This PR implements the concurrent-safe random state you suggested:

Independent random state: withRandomState(MLXRandom.RandomState()) in TopPSampler and CategoricalSampler

Results

This completely solves concurrent sampling crashes at the Evaluate layer. We've added the integration tests you suggested to EvalTests.swift for concurrent evaluation, sampling, and random state isolation.

Remaining Issues

With higher concurrency (3+ requests), we still see occasional Metal layer errors (addCompletedHandler after commit call). These appear to be in MLX-Swift's core Metal management and should be addressed upstream.

Next Steps

I plan to investigate the remaining Metal concurrency issues in mlx-swift itself and would welcome any suggestions or collaboration on that front.

davidkoski · 2025-07-10T15:32:35Z

Libraries/MLXLMCommon/Evaluate.swift

+            logits = logits.asType(.float32)
+        }
+
+        return withRandomState(MLXRandom.RandomState()) {


I wonder if we can make a new property of the sampler, e.g.

let randomState = MLXRandom.RandomState()

and then use a similar compiledTopPSampling that also takes in the randomState as a parameter.

As this is it creates a new RandomState for every call to sample it. This should work, but might be more costly than simply holding the state in the sampler (the sampler should be created for every call to the iterator).

davidkoski · 2025-07-10T15:35:46Z

Tests/MLXLMTests/EvalTests.swift

+            hiddenSize: 64, hiddenLayers: 4, intermediateSize: 128, attentionHeads: 8,
+            rmsNormEps: 0.00001, vocabularySize: 100, kvHeads: 4)
+        let model = LlamaModel(config)
+        quantize(model: model, groupSize: 64, bits: 4)


We may need to eval the model here -- we want to make sure that the weights are all realized at the point where we start using them. To start with they are all promises for random values.

Similar to this in loadWeights:

// apply the loaded weights let parameters = ModuleParameters.unflattened(weights) try model.update(parameters: parameters, verify: [.all]) eval(model)

We aren't applying loaded weights, but the same idea of eval at the end would apply here. It isn't critical for the existing tests because they are not concurrent.

sxy-trans-n added 3 commits July 7, 2025 15:56

Fix thread safety in CategoricalSampler and TopPSampler

2dfd884

Add thread safety to TokenIterator MLX evaluation operations

44befc5

format code

eb345ef

sxy-trans-n mentioned this pull request Jul 7, 2025

Unable to run request in parallel on the same model instance Trans-N-ai/swama#41

Open

Fix MLX concurrency issues by protecting asyncEval with global lock

ed02db9

davidkoski reviewed Jul 8, 2025

View reviewed changes

fix: Add concurrent-safe random state isolation to samplers

550613a

davidkoski reviewed Jul 10, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix thread safety issues in MLX concurrent inference (Samplers + TokenIterator) #351

Fix thread safety issues in MLX concurrent inference (Samplers + TokenIterator) #351

sxy-trans-n commented Jul 7, 2025

Uh oh!

davidkoski Jul 8, 2025

Uh oh!

davidkoski Jul 8, 2025

Uh oh!

davidkoski Jul 8, 2025

Uh oh!

davidkoski commented Jul 8, 2025

Uh oh!

sxy-trans-n commented Jul 9, 2025

Uh oh!

ronaldmannak commented Jul 9, 2025

Uh oh!

davidkoski commented Jul 9, 2025

Uh oh!

sxy-trans-n commented Jul 10, 2025

Uh oh!

davidkoski Jul 10, 2025

Uh oh!

davidkoski Jul 10, 2025

Uh oh!

Uh oh!

Fix thread safety issues in MLX concurrent inference (Samplers + TokenIterator) #351

Are you sure you want to change the base?

Fix thread safety issues in MLX concurrent inference (Samplers + TokenIterator) #351

Conversation

sxy-trans-n commented Jul 7, 2025

🐛 Problem

Error Symptoms

Root Cause Analysis

1. Sampler Race Conditions

2. MLX Evaluation Race Conditions

🔧 Solution

Key Changes

Design Decisions

📊 Performance Impact

Uh oh!

davidkoski Jul 8, 2025

Choose a reason for hiding this comment

Uh oh!

davidkoski Jul 8, 2025

Choose a reason for hiding this comment

Uh oh!

davidkoski Jul 8, 2025

Choose a reason for hiding this comment

Uh oh!

davidkoski commented Jul 8, 2025

Uh oh!

sxy-trans-n commented Jul 9, 2025

Our Use Case

The Problem

Questions

Next Steps

Uh oh!

ronaldmannak commented Jul 9, 2025

Uh oh!

davidkoski commented Jul 9, 2025

Uh oh!

sxy-trans-n commented Jul 10, 2025

Results

Remaining Issues

Next Steps

Uh oh!

davidkoski Jul 10, 2025

Choose a reason for hiding this comment

Uh oh!

davidkoski Jul 10, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!